System and method for generating an identification signal for electronic devices

ABSTRACT

A system and method for creating a ring tone for an electronic device takes as input a phrase sung in a human voice and transforms it into a control signal controlling, for example, a ringer on a cellular telephone. Time-varying features of the input signal are analyzed to segment the signal into a set of discrete notes and assigning to each note a chromatic pitch value. The set of note start and stop times and pitches are then translated into a format suitable for controlling the device.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a division of and claims priority under 35 U.S.C.§120 to U.S. application Ser. No. 10/037,097, filed Dec. 31, 2001, theentire contents of this application being hereby fully incorporated byreference.

FIELD OF THE INVENTION

This invention relates generally to personal electronic devices, andmore particularly to generating personalized ring tones for personalelectronic devices such as cellular telephone.

BACKGROUND OF THE INVENTION

It is desirable to personalize the presentation of portable electronicappliances to distinguish one appliance from other similar applianceswhere they may otherwise be confused or simply to conform thepresentation of an appliance to its owner's personal preference. Currentmobile telephones, for example, provide options for customizing the ringtone sequence that give the user a choice of what sequence is pleasantto the user's ear, the user's style, and unique to the user'spersonality. The proliferation of affordable mobile handsets andservices has created an enormous market opportunity for wirelessentertainment and voice-based communication applications, a consumerbase that is an order of magnitude larger than the personal computeruser base.

Although pre-existing sequences of ring tones can be downloaded from avariety of websites, many users wish to create a unique ring tonesequence. The current applications for creating customized ring tonesequences are limited by the fact that people with musical expertisemust create them and the users must have Internet access (in addition tothe mobile handset).

The current methods for generating, sending, and receiving ring tonesequences involve four basic functions. The first function is thecreation of the ring tone sequence. The second function is theformatting of the ring tone sequence for delivery. The third function isthe delivery of the ring tone sequence to a particular handset. Thefourth function is the playback of the ring tone sequence on thehandset. Current methodologies are limited in the first step of theprocess by the lack of available options in the creation step. Allmethodologies must follow network protocols and standards for functionstwo and three for the successful completion of any custom ring tonesystem. Functions two and three could be collectively referred to asdelivery but are distinctly different processes. The fourth function isdependent on the hardware capabilities specific to the handset from themanufacturer and country the handset is sold.

Current methods for the creation of ring tone sequences involve somelevel of musical expertise. The most common way to purchase a customring tone sequence is to have someone compose or duplicate a popularsong, post the file to a commercial Web site service, preview the ringtone sequence, then purchase the selection. This is currently a verypopular method, but is limited by the requirement of an Internetconnection to preview the ring tone sequences. It also requires themusical expertise of someone else to generate the files.

Another common system for the creation of ring tone sequences is to keymanually, in a sequence of codes and symbols, directly into the handset.Typically, these sequences are available on various Internet sites anduser forums. Again, this is limited to users with an Internet connectionand the diligence to find these sequences and input them properly.

A third method involves using tools available through commercialservices and handset manufacturer Web sites that allow the user togenerate a ring tone sequence by creating notes and sounds in acomposition setting such as, a score of music. This involves evengreater musical expertise because it is essentially composing songs noteby note. It also involves the use of an Internet connection.

Another method of creating a ring tone is to translate recorded musicinto a sequence of tones. There are a number of problems that arise whenattempting to translate recorded music into a ring tone sequence for anelectronic device. The translation process generally requiressegmentation and pitch determination. Segmentation is the process ofdetermining the beginning and the end of a note. Prior art systems forsegmenting notes in recordings of music rely on various techniques todetermine note beginning points and end points. Techniques forsegmenting notes include energy-based segmentation methods as disclosedin L. Rabiner and R. Schafer, “Digital Processing of Speech Signal,”Prentice Hall: 1978, pp. 120-135 and L. Rabiner and B. H. Juang,“Fundamentals of Speech Recognition,” Prentice Hall: New Jersey, 1993,pp. 143-149; voicing probability-based segmentation methods as disclosedin L. Rabiner and R. Schafer, “Digital Processing of Speech Signal,”Prentice Hall: 1978, pp. 135-139, 156, 372-373, and T. F. Quatieri,“Discrete-Time Speech Signal Processing: Principles and Practice,”Prentice Hall: New Jersey, 2002, pp. 516-519; and statistical methodsbased on stationarity measures or Hidden Markov models as disclosed inC. Raphael, “Automatic Segmentation of Acoustic Musical Signals UsingHidden Markov Models,” IEEE Transactions on Pattern Analysis and MachineIntelligence, vol. 21, No. 4, 1999, pp. 360-370. Once the note beginningand endpoints have been determined, the pitch of that note over theentire duration of the note must be determined. A variety of techniquesfor estimating the pitch of an audio signal are available, includingautocorrelation techniques, cepstral techniques, wavelet techniques, andstatistical techniques as disclosed in L. Rabiner and R. Schafer,“Digital Processing of Speech Signal,” Prentice Hall: 1978, pp. 135-141,150-161, 372-378; T. F. Quatieri, “Discrete-time Speech SignalProcessing,” Prentice Hall, New Jersey, 2002, pp. 504-516, and C.Raphael, “Automatic Segmentation of Acoustic Musical Signals UsingHidden Markov Models,” IEEE Transactions on Pattern Analysis and MachineIntelligence, Vol. 21, No. 4, 1999, pp. 360-370. Using any of thesetechniques, the pitch can be measured at several times throughout theduration of a note. This resulting sequence of pitch estimates may thenbe used to assign a single pitch (frequency) to a note, as pitchestimates vary considerably over the duration of a note. This is true ofmost acoustic instruments and especially the human voice, which ischaracterized by multiple harmonics, vibrato, aspiration, and otherqualities which make the assignment of a single pitch quite difficult.

It is desirable to have a system and method for creating a unique ringtone sequence for a personal electronic device that does not requiremusical expertise or programming tasks.

It is an object of the present invention to provide a system andapparatus to transform an audio recording into a sequence of discretenotes and to assign to each note a duration and frequency from a set ofpredetermined durations and frequencies.

It is another object of the present invention to provide a system andapparatus for creating custom ring tone sequences by transforming aperson's singing, or any received song that has been sung, into a ringtone sequence for delivery and use on a mobile handset.

SUMMARY OF THE INVENTION

The problems of creating an individualized identification signal forelectronic devices are solved by the present invention of a system andmethod for generating a ring tone sequence from a monophonic audioinput.

The present invention is a digital signal processing system fortransforming monophonic audio input into a resulting representationsuitable for creating a ring tone sequence for a mobile device. Itincludes a method for estimating note start times and durations and amethod for assigning a chromatic pitch to each note.

A data stream module samples and digitizes an analog vocalized signal,divides the digitized samples into segments called frames, and storesthe digital samples for a frame into a buffer.

A primary feature estimation module analyzes each buffered frame ofdigitized samples to produce a set of parameters that represent salientfeatures of the voice production mechanism. The analysis is the same foreach frame. The parameters produced by the preferred embodiment are aseries of cepstral coefficients, a fundamental frequency, a voicingprobability and an energy measure.

A secondary feature estimation module performs a representation of theaverage change of the parameters produced by the primary featureestimation module.

A tertiary feature estimation module creates ordinal vectors that encodethe number of frames, both forward and backward, in which the directionof change encoded in the secondary feature estimation modules remain thesame.

Using the primary, secondary and tertiary features, a two-phasesegmentation module produces estimates of the starting and ending framesfor each segment. Each segment corresponds to a note. The first phase ofthe two-phase segmentation module categorizes the frames into regions ofupward energy followed by downward energy by using the tertiary featurevectors. The second phase of the two-phase segmentation module looks forsignificant changes in the primary and secondary features over thecategorized frames of successive upward and downward energy to determinestarting and ending frames for each segment.

Finally, after the segments have been determined, a pitch estimationmodule provides an estimate of each note's pitch based on primarily thefundamental frequency as determined by the primary feature estimationmodule.

A ring tone sequence generation module uses the notes start time,duration, end time and pitch to generate a representation adequate forgenerating a ringing tone sequence on a mobile device. In the preferredembodiment, the ring tone sequence generation module produces outputwritten in accordance with the smart messaging specification (SMS)ringing tone syntax, a part of the Global System for MobileCommunications (GSM) standard. The output may also be in Nokia Ring ToneTransfer Language, Enhanced Messaging Service (EMS) which is a standarddeveloped by the Third Generation Partnership Project (3GPP), iMelodywhich is a standard for defining sounds within EMS, Multimedia MessagingService (MMS) which is standardized by 3GPP, WAV which is a format forstoring sound files supported by Microsoft Corporation and by IBMCorporation, and musical instrument digital interface (MIDI) which isthe standard adopted by the electronic music industry. These outputs aresuitable for being transmitted via smart messaging specification.

The present invention together with the above and other advantages maybest be understood from the following detailed description of theembodiments of the invention illustrated in the drawings, wherein:

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a telephone-based song processing andtransmission system according to principles of the invention;

FIG. 2A is a block diagram of a ring tone sequence subsystem of FIG. 1;

FIG. 2B is a block diagram of the primary feature parameters for a givenframe whose values are generated by the primary feature estimationmodule of FIG. 2A;

FIG. 2C is a block diagram of the secondary feature parameters for agiven frame whose values are generated by the secondary featureestimation module of FIG. 2A;

FIG. 2D is a block diagram of the tertiary feature parameters for agiven frame whose values are generated by the tertiary featureestimation module of FIG. 2A;

FIG. 3 is a block diagram of the two-phase segmentation modules inaccordance with the present invention;

FIG. 4 is a part block diagram, part flow diagram of the operation ofthe pitch assignment module including the intranote pitch assignmentsubsystem and the internote pitch assignment subsystem of FIG. 1;

FIG. 5 is a part block diagram, part flow diagram of the operation ofthe intranote pitch assignment subsystem of FIG. 4;

FIG. 6 is a part block diagram, part flow diagram of the operation ofthe internote pitch assignment subsystem of FIG. 5; and

FIG. 7 is a block diagram of a networked computer implementation of thesystem of FIG. 1.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1 is a block diagram of a system 10 suitable for accepting an inputof a monophonic audio signal. In a first alternative embodiment of theinvention, the monophonic audio signal is a vocalized song. The system10 provides an output of information for programming a correspondingring tone for mobile telephones according to principles of the presentinvention. The system 10 has a telephony (or mobile) call handler 50, aring tone sequence application 40 that transforms vocal input inaccordance with the present invention, and a SMS handler 30. Inputsignal 5 from a source 2 is received at the call handler 50 for voicecapture. The input signal would be of limited duration, for example,typically lasting between 5 and 60 seconds. Signals of shorter or longerduration are possible. The voice signal is then digitized and is thentransmitted to the ring tone sequence subsystem 40. While the inputshown here is an analog receiver such as an analog telephone, the inputcould also be received from an analog-to-digital signal transducer.Further, instead of receiving an input signal over a telephone network,the input signal could instead be received at a kiosk or over theInternet.

The ring tone sequence subsystem 40 analyzes the digitized voice signal15, represents it by salient parameters, segments the signal, estimatesa pitch for each segment, and produces a note-based sequence 25. The SMShandler 30 processes the note-based sequence 25 and transmits an SMScontaining the ring tone representation 35 of discrete tones to aportable device 55 having the capability of “ringing” such as a cellulartelephone. The ring tone representation results in an output from the“ringing” device of a series of tones recognizable to the human ear as atranslation of the vocal input.

Ring Tone Sequence Subsystem

FIG. 2A is a block diagram of the ring tone sequence subsystem 40 ofFIG. 1. FIG. 2A illustrates in greater detail the main components of thering tone sequence subsystem 40 and the component interconnections. Thering tone sequence subsystem 40 has a data stream module 100, a primaryfeature estimation module 120, a secondary feature estimation module130, a tertiary feature estimation module 140, a segmentation module 300comprising a first-phase segmentation module 150 and a second-phasesegmentation module 160, a intranote pitch assignment subsystem 170, anda internote pitch assignment subsystem 180.

In the data stream module, 100, signal preprocessing is first applied,as known in the art, to facilitate encoding of the input signal. As iscustomary in the art, the digitized acoustic signal, x, is next dividedinto overlapping frames. The framing of the digital signal ischaracterized by two values: the frame rate in Hz (or the frameincrement in seconds which is simply the inverse of the frame rate) andthe frame width in seconds. In a preferred embodiment of the invention,the acoustic signal is sampled at 8,000 Hz and is enframed using a framerate of 100 Hz and a frame width of 36.4 milliseconds. In a preferredembodiment, the separation of the input signal into frames isaccomplished using a circular buffer having a size of 291 sample storageslots. In other embodiments the input signal buffer may be a linearbuffer or other data structure. The framed signal 115 is output to theprimary feature estimation module 120.

Primary Feature Estimation Module

The primary feature estimation module 120, shown in FIG. 2A, produces aset of time varying primary features 125 for each frame of the digitizedinput signal 15. FIG. 2B depicts a “primary data structure” 125A used tostore the primary features 125 for one frame of the digitized inputsignal 15. The primary features generated by the primary featureestimation module 120 for each frame and stored in the primary datastructure 125A are:

-   -   time-domain energy measure, E, 226    -   fundamental frequency, f₀, 222    -   cepstral coefficients, {c₀, c₁}, 220    -   cepstral-domain energy measure, e, 228,    -   voicing probability v, 224

The primary features are extracted as follows. The input is thedigitized signal, x, which is a discrete-time signal that represents anunderlying continuous waveform produced by the voice or other instrumentcapable of producing an acoustic signal and therefore a continuouswaveform. The primary features are extracted from each frame. Let [x]nrepresent the value of the signal at sample n. The time at sample nrelative to the beginning of the signal, n=0, is n/f_(s), where f_(s) isthe sampling frequency in Hz. Let F(i) represent the index set of all nin frame i, and N_(F) the number of samples in each frame.

The time-domain energy measure is extracted from frame i according tothe formula $\begin{matrix}{{E\lbrack i\rbrack} = {\frac{1}{N_{F}}{\sum\limits_{m \in {F{(i)}}}\left\lbrack {{w\left( {m - i} \right)}\left( {{x\lbrack m\rbrack} - \overset{\_}{x}} \right)} \right\rbrack^{2}}}} & (1)\end{matrix}$where {overscore (x)}is the mean of x[m] for all mεF(i) and w is awindow function. Equation 1 states that time-domain energy measure 226is extracted by multiplying the signal with the mean removed by thewindow, summing the square of the result, and normalizing by the numberof samples in the frame. The window w reaches a maximum at the center ofthe frame and reaches a minimum at the beginning and end of the frame.The window function is a unimodal window function. The preferredembodiment uses a Hamming window. Other types of windows that may beused include a Hanning window, a Kaiser window, a Blackman window, aBartlett window and a rectangular window.

The fundamental frequency 222 is estimated by looking for periodicity inx. The fundamental frequency at frame i, is calculated by estimating thelongest period in frame i, T₀[i], and taking its inverse,$\begin{matrix}{{f_{0}\lbrack i\rbrack} = \frac{1}{T_{0}\lbrack i\rbrack}} & (2)\end{matrix}$In the preferred embodiment, f₀[i] is calculated using frequency domaintechniques. Pitch detection techniques are well known in the art and aredescribed, for example, in L. Rabiner and R. Schafer, “DigitalProcessing of Speech Signal,” Prentice Hall: 1978, pp. 135-141, 150-161,372-378; T. F. Quatieri, “Discrete-time Speech Signal Processing,”Prentice Hall, New Jersey, 2002, pp. 504-516. The cepstral coefficients220 are extracted using the complex cepstrum by computing the inversediscrete Fourier transform of the complex natural logarithm of theshort-time discrete Fourier transform of the windowed signal. Theshort-time discrete Fourier transform is computed using techniquescustomary in the prior art. Let X[i,k] be the discrete Fourier transformof the windowed signal, which is computed according to the formula$\begin{matrix}{{X\left\lbrack {i,k} \right\rbrack} = {\sum\limits_{m \in {F{(i)}}}{{w\left( {m - 1} \right)}\left( {{x\lbrack m\rbrack} - \overset{\_}{x}} \right){\mathbb{e}}^{\frac{{- {j2\pi}}\quad{mk}}{N}}}}} & (3)\end{matrix}$where N is the size of the discrete Fourier transform and F′ (i) is F(i)with N-N_(F) zeros added.

The cepstral coefficients are computed from the discrete Fouriertransform of the natural logarithm of X[i,k] as $\begin{matrix}{{c_{m}\lbrack i\rbrack} = {\sum\limits_{k = 0}^{N - 1}{\log\quad{X\left\lbrack {i,k} \right\rbrack}{\mathbb{e}}^{\frac{{j2\pi}\quad{mk}}{N}}}}} & (4)\end{matrix}$wherelog X[i,k]=log|X[i,k]|+jAngle(X[i,k])  (5)and where Angle(X[i,k]) is the angle between the real and imaginaryparts of X[i,k]. In the preferred embodiment, the primary featuresinclude the first three cepstral coefficients, i.e., c_(m)[i] for m={0,1}. Cepstral coefficients, derived from the inverse Fourier transform ofthe log magnitude spectrum generated from a short-time Fourier transformof one frame of the input signal, are well known in the art and isdescribed, for example, in L. Rabiner and B. H. Juang, “Fundamentals ofSpeech Recognition,” Prentice Hall: New Jersey, 1993, pp. 143-149, whichis hereby incorporated by reference as background information.

The cepstral-domain energy measure 228 is extracted according to theformula $\begin{matrix}{{{\mathbb{e}}\lbrack i\rbrack} = \frac{{c_{o}\left\lbrack i^{\prime} \right\rbrack} - {\overset{\_}{c}}_{0}}{\max\limits_{i^{\prime}}\left( {c_{o}\left\lbrack i^{\prime} \right\rbrack} \right)}} & (6)\end{matrix}$

The cepstral-domain energy measure represents the short-time cepstralgain with the mean value removed and normalized by the maximum gain overall frames.

The voicing probability measure 224 is defined as the point between thevoiced and unvoiced portion of the frequency spectrum for one frame ofthe signal. A voiced signal is defined as a signal that contains onlyharmonically related spectral components whereas an unvoiced signal doesnot contain harmonically related spectral components and can be modeledas filtered noise. In the preferred embodiment, if v=1 the frame of thesignal is purely voiced; if v=0, the frame of the signal is purelyunvoiced.

Secondary Feature Estimation Module

The secondary feature estimation module 130, shown in FIG. 2A, producesa set of time varying secondary features 135 based on each of thefeatures 125. FIG. 2C depicts a “secondary data structure” 135A used tostore the secondary features 135 for one frame of the digitized inputsignal 15. The secondary feature estimation module 135 generatessecondary features by taking short-term averages of the primary features125 output from the primary feature estimation module 120. Short-termaverages are typically taken over 2-10 frames. In a preferredembodiment, short-term averages are computed over three consecutiveframes. Secondary features generated for each frame and stored in thesecondary data structure 135A are:

-   -   short-term average change in time-domain energy E, {overscore        (ΔE)}, 242    -   short-term average change in fundamental frequency f₀,        {overscore (Δf₀)}, 236    -   short-term average change in cepstral coefficient c₁, {overscore        (Δc₁)}, 232    -   short-term average change in cepstral-domain energy e,        {overscore (Δe)}, 240        Tertiary Feature Estimation Module

The tertiary feature estimation module 140, shown in FIG. 2A, produces aset of time varying tertiary features 145 based on two of the fivesecondary features 135. FIG. 2D depicts a “tertiary data structure” 145Aused initially to store the tertiary features 145 for one frame of thedigitized input signal 15. The tertiary feature estimation module 145generates tertiary features that represent the number of consecutiveframes for which a given primary feature 135 changed in the samedirection. Tertiary features generated for each frame and stored in thetertiary data structure 145A are:

-   -   count of consecutive upward short-term average change in        cepstral-domain energy e, N({overscore (Δe)}>0), 244    -   count of consecutive downward short-term average change in        cepstral-domain energy e, N({overscore (Δe)}<0), 246    -   count of consecutive upward short-term average change in        fundamental frequency f₀, N({overscore (Δf₀)}>0), 248    -   count of consecutive downward short-term average change in        fundamental frequency f₀, N({overscore (Δf₀)}<0), 250

In the preferred embodiment, counters N(a) are provided for each framefor each of the four tertiary features. The counters are reset wheneverthe argument a is false. The function N(a) is a function of both theframe number “a” and the particular feature being counted. For example,N(a) for short-term average change in f₀ is false when the value of theshort-term average change at frame “a” is less than zero.

Two-Phase Segmentation Module

FIG. 3 is a block diagram of the two-phase segmentation module 300including the first-phase segmentation module 150 and the second-phasesegmentation module 160, shown in FIG. 2A. The first-phase segmentationmodule 150 groups successive frames into regions based on two of thetertiary features 145. A region is a set of frames in which the changein energy increases immediately followed by frames in which the changein energy decreases. Specifically, the tertiary features N({overscore(Δe)}>0), 244 and N({overscore (Δe)}<0), 246 are used to groupsuccessive frames into regions. A region, in order to be valid, musthave at least a minimum number of frames, for example 10 frames. Aregion is defined in this way because a valid start frame, i.e. a notestart, is a transitory event when energy is in flux. That is, a notedoes not start when the energy is flat, or when it is decreasing, orwhen it is continually increasing. A note start is generallycharacterized by an increase in energy followed by an immediate decreasein the change in energy. Typically there are 4-12 frames of increasingenergy followed by 10-35 frames of decreasing energy.

For each region determined by the first-phase segmentation module 150, acandidate note start frame is estimated. Within the region, thecandidate start frame is determined as the last frame within the regionin which the tertiary feature N({overscore (Δe)}>0), 244 contains anon-zero count. The second-phase segmentation module 160 determineswhich regions contain valid note start frames. Valid note start framesare determined by selecting all regions estimated by the first-phasesegmentation module 150 that contain significant correlated changewithin regions. Each region starts when a given frame of N({overscore(Δe)}>0), 244 contains a non-zero count and the previous frame ofN({overscore (Δe)}>0), 244 contains a zero.

The second-phase segmentation module 160 uses three threshold-basedcriteria for determining which regions and their corresponding startframes actually represent starting note boundaries. The first criteriais based on the primary feature which is the cepstral domain energymeasure e. Each frame is evaluated within a valid region as determinedby the first-phase segmentation process. A frame, within a valid region,is marked if it is greater than a cepstral domain energy threshold andthe previous frame is less than the threshold. An example value of thecepstral domain energy threshold is 0.0001. If a valid region has anymarked frames, the corresponding start frame based on N({overscore(Δe)}>0) is chosen as a start frame representing an actual noteboundary.

The second and third criteria use parameters to select whether a framewithin a valid region R is marked. The parameter used by the secondcriteria, referred to herein as the fundamental frequency range anddenoted by Range(f₀[i],R), is calculated according to${{Range}\left( {{f_{0}\lbrack i\rbrack},R} \right)} = {{\max\limits_{i \in R}\left( {f_{0}\lbrack i\rbrack} \right)} - {\min\limits_{i \in R}{\left( {f_{0}\lbrack i\rbrack} \right).}}}$An example fundamental frequency range threshold is 0.45 MIDI notenumbers. Equation 7 provides a conversion from hertz to MIDI notenumber.

The parameter used by the third criteria, referred to herein as theenergy range and denoted by Range(e[i],R), is calculated similarly. Anexample value of the energy threshold is a 0.2.

The candidate note start frame, within a valid region, is chosen as astart frame representing an actual note boundary if the fundamentalfrequency range and energy range or cepstral domain energy measureexceed these thresholds.

For each start frame, resulting from the three criteria described above,a corresponding stop frame of the note boundary is found by selectingthe first frame that occurs after each start frame in which the primaryfeature e for that frame drops below the cepstral domain energythreshold. In the preferred embodiment, if e does not drop below thecepstral domain energy threshold on a frame prior to the next startframe, the stop frame is given to be a predefined number of framesbefore the next start frame. In the preferred embodiment of theinvention, this stop frame is between 1 and 10 frames before the nextstart frame.

The output of the Two-Phase Segmentation Module is a list of note startand stop frames.

In the preferred embodiment, a segmentation post-processor 166 is usedverify the list of note start and stop frames. For each note, whichconsists of all frames between each pair of start and stop frames, threevalues are calculated, which include the average voicing probability v,the average short-time energy e and the average fundamental frequency.These values are used to check whether the corresponding note should beremoved from the list. For example, in the preferred embodiment, if theaverage voicing probability for a note is less than 0.12, the note isclassified as a “breath” sound or a “noise” and is removed from the listsince it is not considered a “musical” note. Also, for example, in thepreferred embodiment, if the average energy e is less than 0.0005, thenthe note is considered “non-musical” as well and is classified as“noise” or “un-intentional sound”.

Pitch Assignment Module

FIG. 4 shows the process of the pitch assignment module including theintranote pitch assignment subsystem 170 and the internote pitchassignment subsystem 180 of FIG. 1. The Pitch Assignment Module acceptsas input the output of the Two-Phase Segmentation Module and the PrimaryFeature Estimation Module, and assigns a single pitch to each notedetected by the Two-Phase Segmentation Module, step 190. This output isfirst sent to the intranote pitch assignment subsystem, step 200. Outputfrom the intranote pitch assignment subsystem, step 205 is sent to theinternote pitch assignment system, step 205. The Intranote PitchAssignment Subsystem 170 and the Internote Pitch Assignment Subsystem180, determine the assigned pitch for each note in the score. The majordifference between these two subsystems is that the Intranote PitchAssignment Subsystem does not use contextual information (i.e., featurescorresponding to prior and future notes) to assign MIDI note numbers tonotes, whereas the Internote Pitch Assignment Subsystem does make use ofcontextual information from other notes in the score. The output of thepitch assignment module is a final score data structure, 210. The scoredata structure includes the starting frame number, the ending framenumber, and the assigned pitch for each note in the sequence. Theassigned pitch for each note is an integer between 32 and 83 thatcorresponds to the Musical Instrument Digital Interface (MIDI) notenumber.

The set of primary features between and including the starting andending frame numbers are used to determine the assigned pitch for eachnote as follows. Let S_(J) denote the set of frame indices between andincluding the starting and ending frames for note j. The set offundamental frequency estimates within note j is denoted by{f₀[i],∀iεS_(J)}.

FIG. 5 shows the operation of the intranote pitch assignment subsystem,170. The Intranote Pitch Assignment Subsystem consists of fourprocessing stages: the Energy Thresholding Stage 201, the VoicingThresholding Stage 202, the Statistical Processing Stage 203, and thePitch Quantization Stage 204. The Energy Thresholding Stage removes fromS_(j) fundamental frequency estimates with corresponding time-domainenergies less than a specified energy threshold, which is for example0.1 and creates a modified frame index set S_(j) ^(E). The VoicingThresholding Stage removes from S_(j) ^(E) fundamental frequencyestimates with corresponding voicing probabilities less than a specifiedvoicing probability threshold and creates a modified frame index setS_(J) ^(EV). An example value of the voicing probability threshold is0.5. The Statistical Processing Stage computes the median and mode of{f₀[i],∀iεS_(J) ^(EV)} and classifies {f₀[i],∀iεS_(J) ^(EV)} into one ormore distributional types with a corresponding confidence estimate forthe classification decision. Distributional types may be determinedthrough clustering as described in K. Fukunaga, Statistical PatternRecognition, 2nd Ed. Academic Press, 1990, p. 510. In a preferredembodiment, the distributional types are flat, rising, falling, andvibrato, however many more distributional types are possible. Also in apreferred embodiment of the invention, the class decisions are made bychoosing the class with the minimum squared error between the classtemplate vector and the fundamental frequency vector with elements{f₀[i], ∀iεS_(J) ^(EV)}. The mode is computed in frequency binscorresponding to quarter tones of the chromatic scale. The PitchQuantization Stage accepts as input the median, mode, distributionaltype, and class confidence estimate and assigns a MIDI note number tothe note. A given fundamental frequency in Hz is converted to a MIDInote number according to the formula $\begin{matrix}{m = {m_{A} + {12{\log_{2}\left( \frac{f_{0}}{f_{A}} \right)}}}} & (7)\end{matrix}$where m_(A)=69 and f_(A)=440 Hz. In the preferred embodiment, MIDI notenumbers are assigned as follows. For flat distributions with highconfidence, the MIDI note number is the nearest MIDI note integer to themode. For rising and falling distributions, the MIDI note number is thenearest MIDI note integer to the median if the note duration is lessthan 7 frames and the nearest MIDI note integer to the mode otherwise.For vibrato distributions, the MIDI note number is the nearest MIDI noteinteger to the mode.

FIG. 6 shows the operation of the internote pitch assignment subsystem180. The Internote Pitch Assignment Subsystem consists of two processingstages: the Key Finding Stage 207 and the Pairwise Correction Stage 206.The Key Finding Stage assigns the complete note sequence a scale in theionic or aolian mode, based on the distribution of Tonic, Mediant andDominant pitch relationships that occur in the sequence. A scale iscreated for each chromatic pitch class, that is for C, C#, D, D#, E, F,F#, G, G#, A, A# and B. Each pitch class is also assigned a probabilityweighted according to scale degree. For example, the first, sixth,eighth and tenth scale degrees are given negative weights and the zeroth(the tonic), the second, the fourth, fifth, seventh and ninth are givenpositive weights. The zeroth, fourth and seventh scale degrees are givenadditional weight because they form the tonic triad in a major scale.

The note sequence is compared to the scale with the highest probabilityas a template, and a degree of fit is calculated. In the preferredimplementation the measure of fit is calculated by scoring pitchoccurrences of Tonic, Mediant and Dominant pitch functions asinterpreted by each scale. The scale with the highest number of Tonic,Mediant and Dominant occurrences will have the highest score. Thecomparison may lead to a change of the MIDI note numbers of notes in thescore that produce undesired differences. The differences are calculatedin the Pairwise Correction Stage.

In the Pairwise Correction Stage, MIDI note numbers that do not fit thescale template are first examined. A rules-based decision tree is usedto evaluate a pair of pitches—the nonconforming pitch and the pitch thatprecedes it. Such rule-based decision tree based on Species Counterpointvoice-leading rules are well known in the art, and are described, forexample, in D. Temperley, “The Cognition of Basic Musical Structure,”The MIT Press, Cambridge, Mass., 2001, pp. 173-182. The rules are thenused to evaluate the pair of notes consisting of the nonconforming pitchand the pitch that follows it. If both pairs conform to the rules, thenonconforming pitch is left unaltered. If the pairs do not conform tothe rules the nonconforming pitch is modified to fit within the assignedscale.

The corrected sequence is again examined to identify pairs that may notconform to the voice-leading rules. Pairs that do not conform arelabeled dissonant and may be corrected. They are corrected if adjustingone note in the pair does not cause a dissonance (dissonance is definedby standard Species Counterpoint rules) in an adjacent pair eitherpreceding or following the dissonant pair.

Each pair is then compared to the frequency ratios derived during thePitch Quantization Stage. If a pair can be adjusted to more accuratelyreflect the ratio expressed by pairs of frequencies, it is adjusted tomore accurately reflect that ratio. In the preferred implementation, theadjustment is performed by raising or lowering a pitch from a pair if itdoes not cause a dissonance in an adjacent pair.

Computer System Implementation

FIG. 7 depicts a computer system 400 incorporating a recording and notegeneration, in place of the call handling and SMS handling,respectively, shown in FIG. 1. This is another preferred embodiment ofthe present invention. The computer system includes a central processingunit (CPU) 402, a user interface 404 (e.g., standard computer interfacewith a monitor, keyboard and mouse or similar pointing device), an audiosignal interface 406, a network interface 408 or similar communicationsinterface for transmitting and receiving signals to and from othercomputer systems, and memory 410 (which will typically include bothvolatile random access memory and non-volatile memory such as disk orflash memory).

The audio signal interface 406 includes a microphone 412, low passfilter 414 and analog to digital converter (ADC) 416 for receiving andpreprocessing analog input signals. It also includes a speaker driver420 (which includes a digital to analog signal converter and signalshaping circuitry commonly found in “computer sound boards”) and anaudio speaker 418.

The memory 410 stores an operating system 430, application programs, andthe previously described signal processing modules. The other modulesstored in the memory 410 have already been described above and arelabeled with the same reference numbers as in the other figures.

Alternate Embodiments

While the present invention has been described with reference to a fewspecific embodiments, the description is illustrative of the inventionand is not to be construed as limiting the invention. Variousmodifications may occur to those skilled in the art without departingfrom the true spirit and scope of the invention as defined by theappended claims.

For instance, the present invention could be embedded in a communicationdevice, or stand-alone game device or the like. Further, the inputsignal could be a live voice, an acoustic instrument, a prerecordedsound signal, or a synthetic source.

It is to be understood that the above-described embodiments are simplyillustrative of the principles of the invention. Various and othermodifications and changes may be made by those skilled in the art whichwill embody the principles of the invention and fall within the spiritand scope thereof.

1. A method for generating an identification signal, comprising:accepting as input a monophonic audio signal of limited duration;translating said monophonic audio signal to a representation of a seriesof discrete tones; and producing a control signal from saidrepresentation of discrete tones, said control signal suitable forcausing a transponder to generate a signal, where said generated signalis a translation of said monophonic audio signal; wherein translatingsaid monophonic audio signal to the representation of the series ofdiscrete tones includes segmenting the monophonic audio signal into aseries of segments according to time varying features of the audiosignal that include a feature associated with energy and a featureassociated with spectral composition, wherein each tone in the series ofdiscrete tones is associated with a different segment in the series ofsegments.
 2. A method for generating an identification signal,comprising: accepting as input a voice signal of limited duration;translating said voice signal to a representation of a series ofdiscrete tones; and producing a control signal from said representationof discrete tones, said control signal suitable for causing atransponder to generate a signal, where said generated signal is atranslation of said voice signal; wherein translating said voice signalto the representation of the series of discrete tones includessegmenting the voice signal into a series of segments according to timevarying features of the voice signal that include a feature associatedwith energy and a feature associated with spectral composition, whereineach tone in the series of discrete tones is associated with a differentsegment in the series of segments.
 3. The method of claim 2 wherein saidgenerated signal is melodically human-recognizable.
 4. The method ofclaim 2 wherein said generated signal is rhythmicallyhuman-recognizable.
 5. The method of claim 2 wherein accepting as inputfurther comprises receiving said voice signal over a telephoneconnection.
 6. The method of claim 5 wherein said telephone connectionis wireless.
 7. The method of claim 2 wherein said step of accepting asinput further comprises receiving said voice signal over a microphoneattached to a computer.
 8. The method of claim 2 wherein saidtranslating step further comprises translating said voice signal to arange of tones within the capability of a mobile telephone audio outputsynthesizer.
 9. The method of claim 2 further comprising the step oftransmitting said control signal to a tone-producing output deviceresponsive to said control signal.
 10. The method of claim 2 whereinsaid translating step further comprises: generating a digitalrepresentation of said voice signal; dividing said digitized signal intoa plurality of frames; extracting analysis data from each said frame;and formatting said analysis data into a frame representation.
 11. Themethod of claim 10 further comprising the step of segmenting said signalby counting instances of increased signal amplitude in said frames, andfor each instance of increased amplitude, determining a change in eachof pitch, energy, and spectral composition in a region around saidinstance of increased amplitude, whereby a segment is defined by a startframe having an instance of increased amplitude and an end frame isdefined by changes in pitch, energy and spectral composition in relationto selected thresholds.
 12. The method of claim 10 wherein saidtranslating step further comprises grouping said frames into a pluralityof regions.
 13. The method of claim 12 wherein each said region isdetermined from a count of consecutive upward short-term average changein cepstral-domain energy followed by a count of consecutive downwardshort-term average change in cepstral-domain energy.
 14. The method ofclaim 12 further comprising the step of determining the existence of acandidate note start frame in each said region.
 15. The method of claim13 further comprising the step of determining a candidate note startframe in each said region as the last frame within said region in whichthe count of consecutive upward short-term average change incepstral-domain energy is not zero.
 16. The method of claim 14 furthercomprising the step of determining which regions of said plurality havea valid note start frame.
 17. The method of claim 14, whereindetermining a candidate note start frame further comprises the step ofdetermining if the cepstral domain energy of a particular frame isgreater than a cepstral domain energy threshold and a frame immediatelybefore said particular frame was below said cepstral domain energythreshold.
 18. The method of claim 14, wherein determining a candidatenote start frame further comprises the step of determining whether afundamental frequency range of a particular frame is above a fundamentalfrequency range threshold and whether an energy range for saidparticular frame is above an energy range threshold.
 19. The method ofclaim 14, further comprising the step of determining a stop framecorresponding to each start frame.
 20. The method of claim 15, furthercomprising the step of determining a stop frame by locating the firstframe after a start frame in which cepstral energy is below saidcepstral domain energy threshold.
 21. The method of claim 20, furthercomprising the step of defining the stop frame as a frame between twoand ten frames before a subsequent start frame if no frame havingcepstral energy below said cepstral domain energy threshold is found.22. The method of claim 19 further comprising the step of verifying eachstart and stop frame pair by determining whether a) average voicingprobability is above a voicing probability threshold, b) averageshort-time energy is above an average short-time energy threshold, andc) average fundamental frequency is above an average fundamentalfrequency threshold.
 23. The method of claim 2 wherein the featureassociated with energy includes a time-domain energy.
 24. The method ofclaim 2 wherein the feature associated with energy includes acepstral-domain energy.
 25. The method of claim 2 wherein the timevarying features according to which the voice signal is segmentedinclude at least two features associated with energy.
 26. The method ofclam 2 wherein the feature associated with spectral composition includesa cepstral coefficient.
 27. The method of claim 2 wherein the timevarying features according to which the voice signal is segmentedfurther include a feature associated with periodicity.
 28. The method ofclaim 27 wherein the feature associated with periodicity includes afundamental frequency.
 29. The method of claim 27 wherein featureassociated with periodicity includes a voicing probability. 30.Apparatus for generating an identification signal comprising: a voicesignal receiver; a translator having as its input a voice signalreceived by said voice signal receiver and having as its output arepresentation of discrete tones where an audio presentation of saiddiscrete tones would be human-recognizable as a translation of saidvoice signal; wherein the translator includes an estimation module withoutputs of a time varying feature associated with each of energy andspectral composition from the voice signal and a segmentation moduleresponsive to the time varying features with an output of a segmentationof the voice signal into a series of segments according to the timevarying features, such that each in the series of output discrete tonesis associated with a different segment in the series of segments. 31.The apparatus of claim 30 wherein said voice signal receiver comprisesan analog telephone receiver.
 32. The apparatus of claim 30 wherein saidvoice signal receiver further comprises a voice-to-digital signaltransducer.
 33. The apparatus of claim 30 wherein said voice signalreceiver further comprises a recording device.
 34. The apparatus ofclaim 30 wherein said translator further comprises a feature estimationmodule to determine values for at least one time-varying feature of saidinput signal.
 35. The apparatus of claim 34 wherein said translatorfurther comprises a pitch assignment module responsive to signal energyin each segment output by said segmentation module.
 36. The apparatus ofclaim 34 wherein said feature estimation module further comprises aprimary feature module, a secondary feature module and a tertiaryfeature module.
 37. The apparatus of claim 36 wherein said primaryfeature module determines a plurality of values for each of time-domainenergy, fundamental frequency, cepstral-domain energy, and voicingprobability.
 38. The apparatus of claim 35 wherein said segmentationmodule further comprises a first-phase segmentation module and asecond-phase segmentation module.
 39. The apparatus of claim 38 whereinsaid first-phase segmentation module groups a plurality of successiveframes of said input signal into at least one region in response tooutput of said feature estimation module.
 40. The apparatus of claim 39wherein said region is a plurality of frames in which a change in energyincreases immediately followed by frames in which change in energydecreases.
 41. The apparatus of claim 40 in which said region has aminimum rider of frames.
 42. The apparatus of claim 39 wherein saidsecond-phase segmentation module determines if said at least one regionhas a valid note start frame and if so, determines a stop frame.
 43. Theapparatus of claim 42 wherein said second-phase segmentation moduledetermines said valid note start frame in response to cepstral domainenergy by determining whether a frame has a cepstral domain energygreater than a cepstral domain energy threshold preceded by a framehaving a cepstral domain energy less than said cepstral domainthreshold.
 44. The apparatus of claim 42 wherein said second-phasesegmentation module determines a valid note start frame if thefundamental frequency exceeds a fundamental energy threshold and if thenon-cepstral domain energy exceeds an energy threshold.
 45. Theapparatus of claim 39 further comprising a segmentation post-processorto verify said start and stop frame in response to average voicingprobability, average short-time energy, and average fundamentalfrequency of said start and stop frame.
 46. The apparatus of claim 35wherein said pitch assignment module assigns an integer between 32 and83, said integer corresponding to the MIDI note number for pitch. 47.The apparatus of claim 35 wherein said pitch assignment module comprisesan intranote pitch assignment subsystem and an internote pitchassignment subsystem.
 48. The apparatus of claim 47 wherein saidinternote pitch assignment subsystem corrects pitches determined by saidintranote pitch assignment subsystem.
 49. The apparatus of claim 48wherein said internote pitch assignment subsystem further comprises akey finding stage to assign a scale to a note sequence output by saidintranote pitch assignment subsystem.
 50. The apparatus of claim 48wherein said internote pitch assignment subsystem further comprises apairwise correction stage to examine a pitch and its preceding pitch forconformity to voice-leading rules, if a pair is determined to bedissonant according to said voice-leading rules, the internote pitchassignment subsystem corrects the pitches of said pair if the pitchadjustment does not cause dissonance in an adjacent pair.